Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

RNA-Seq Data Analysis ◾ 185

design matrix using the sample information in the “sampleinfo.txt” that we have created

above. In EdgeR, the design matrix can be defined with or without an intercept. The inter-

cept is used when there is a reference for the differential expression analysis. When the

design matrix is defined without an intercept, the differential analysis can be performed

by using a contrast as we will do. In the following, we define a design matrix without an

intercept (Figure 5.9):

condition <- factor(sampleinfo$condition)

design <- model.matrix(~ 0 + condition)

design

This design matrix defines two dummy variables representing the levels of the condition

studied (1 if the condition is correct and zero otherwise). When we fit a negative binomial

generalized log-linear model described in Formula 22, two coefficient estimates will be

calculated; one for each dummy variable.

5.3.7.4 Filtering Low-Expressed Genes

Some genes may not be expressed or may not have enough reads to contribute to the dif-

ferential analysis. Therefore, it is good practice to retain only the genes that have sufficient

read counts by filtering out the genes with zero or low counts keeping only the ones with at

least one count per million (1 cpm) reads in at least two samples. The following script filters

out the genes with low abundance and adjust the library size to reflect the new change:

keep <- filterByExpr(y, design)

y <- y[keep, , keep.lib.sizes=FALSE]

As shown in Figure 5.10, after filtering, the counts slot contains only genes with sufficient

abundance and the library size in the samples slot has been adjusted. Notice the difference

in the number of genes and library size between Figures 5.10 and 5.6. The new counts slot

contains only 133 genes compared to 632 genes before filtering and the library sizes have

been adjusted to reflect the new ones.

FIGURE 5.9 Design matrix without intercept.